Decision Trees are hierarchical supervised learning algorithms
They primarily help us with:
- Classification and Regression
- Non-linear & Non-parametric Modelling
- Features Selection
How Do We “Grow” a Tree?
Trees work with a greedy divide-and-conquer strategy to identify ways to split a data set based on different conditions (feature, threshold).
Learn simple decision rules (if-else) inferred from the data features.
The algorithm selects the “best” condition based on a specific score.
The algorithm recursively split the subset of data.
When no satisfying conditions are found, the node becomes a leaf.
DecisionTreeClassifier
Select the split that results in the most homogeneous sub-node
Gini Index
The Gini index measures the ability of each feature to separate the data. It calculates the impurity of each node, between [0, 1] - the lower the better.
\[ Gini(node)=1−∑p_i^2 \]
\(p_i\) being the ratio between the observations in a class \(i\) and the total number of observations remaining at a given node.
Predicting
A new point is passed through the tree from top to bottom until it reaches a leaf. The most represented class in that leaf is the predicted class for a given data point
First, let’s split the dataset into train, test and validation sets
Splitting the Data
from sklearn.model_selection import train_test_split# Split data into train, test and validation setsX_train, X_test, y_train, y_test = train_test_split( X, y, test_size =0.3, random_state =42# TEST = 30%)# Use the same function above for the validation setX_test, X_val, y_test, y_val = train_test_split( X_test, y_test, test_size =0.5, random_state =42# TEST = 15%)
Then, we plug the data into the XGBRegressor!
from xgboost import XGBRegressorxgb_reg = XGBRegressor(max_depth=10, n_estimators=100, learning_rate=0.1)xgb_reg.fit(X_train, y_train,# evaluate loss at each iteration eval_set=[(X_train, y_train), (X_val, y_val)], # stop iterating when eval loss increases 5 times in a row early_stopping_rounds=5)y_pred = xgb_reg.predict(X_val)
How it works?
an initial prediction is made on the whole dataset with a single tree.
Residuals are computed based on the predicted value and the observed values. A decision tree is created with the residuals.
The output of the tree becomes the new residual for the dataset, which is used to construct another tree.
This process is repeated until the residuals stop reducing or for a specified number of times. Each subsequent tree learns from the previous trees (Boosting principle)
Optimization: Parameters and methods
XGBoost uses the following parameters and methods to optimize the algorithm and provide better results and performance:
Regularization A Regularization parameter (lambda) is used while calculating the similarity scores to reduce the sensitivity to individual data and avoid overfitting.
Pruning A Tree Complexity Parameter (gamma) is selected to compare the gains. The branch where the gain is smaller than the gamma value is removed. This prevents overfitting by trimming unnecessary branches and reducing the depth of the trees.
Weighted quantile sketch Instead of testing every possible value as the threshold for splitting the data, only weighted quantiles are used.
Parallel Learning Note: Xgboost doesn’t run multiple trees in parallel Rather it does the parallelization WITHIN a single tree to create branches independently.
Pros and Cons of Boosting
👍 Advantages:
Strong sub-models have more influence in final decision
Reduces bias
👎 Disadvantages:
Computationally expensive (sequential)
Easily overfits
Sensitive to outliers (too much time spent trying to correctly predict them)
Ensemble Methods Summary
Ensemble learning combines several base algorithms, like Decision Trees
Ensemble methods can be broken down into two categories:
Parallel learners: models trained in parallel; predictions are aggregated
RandomForestRegressors
BaggingRegressors
Sequential learners: models trained sequentially so as to learn from predecessors’ mistakes
AdaBoostRegressor
GradientBoostRegressor
XGBoostRegressor
4. Stacking
Training different estimators and aggregating their predictions
Different estimators (KNN, LogReg, etc.) capture different structures in the data
Combining sometimes enhances the predictive power
The results are aggregated either by voting (classification) or averaging (regression)
from sklearn.ensemble import VotingClassifierfrom sklearn.linear_model import LogisticRegressionforest = RandomForestClassifier()logreg = LogisticRegression()ensemble = VotingClassifier( estimators = [("rf", forest),("lr", logreg)], voting ='soft', # to use predict_proba of each classifier before voting weights = [1,1] # to equally weight forest and logreg in the vote)ensemble.fit(X_moon, y_moon)plot_decision_regions(X_moon, y_moon, classifier=ensemble)
🤔 How do you choose the best weights?
b) Multi-layer Stacking!
Train a final estimator on the predictions of the previous ones
Most estimators during prediction return E(Y|X), which can be interpreted as the answer to the question, what is the expected value of your output given the input?
…
When a decision tree is fit, we store only the sufficient statistics of the target at the leaf node such as the mean and variance.
Quantile RF uses all the target values in the leaf node and at prediction, these are used to compute empirical quantile estimates.
Light GBM splits the tree leaf-wise with the best fit whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. In other words, Light GBM grows trees vertically while other algorithms grow trees horizontally.
Advantages of Light GBM
Faster training speed and higher efficiency: Light GBM uses a histogram-based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
Lower memory usage: Replaces continuous values to discrete bins which results in lower memory usage.
Almost or same results as XGB - Compatibility with Large Datasets: It is capable of performing equally well with large datasets with a significant reduction in training time as compared to XGBoost.
Disadvantages of Light GBM - Overfitting: Light GBM split the tree leaf-wise which can lead to overfitting as it produces much complex trees.
NGBoost enables predictive uncertainty estimation with Gradient Boosting through probabilistic predictions.
Probabilistic prediction is the approach where the model outputs a full probability distribution over the entire outcome space, instead a single point-wise estimation.
How?
By nonparametrically modeling the conditional parameters of the multivariate predictive distribution.
For example, \(Y|X\) may be assumed to follow a univariate Normal distribution where \(\mu\) and \(\sigma\) are taken to be the parameter vector \(\theta\).
The output of NGBoost is \(\theta(X) = (\mu(X), \sigma(X))\).